On Estimating Frequency Moments of Data Streams

نویسندگان

  • Sumit Ganguly
  • Graham Cormode
چکیده

Space-economical estimation of the pth frequency moments, defined asFp = Pn i=1|fi| , for p > 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1, . . . , fn with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of p-stable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the stable sketches technique and its variants is that they require O( 1 2 ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We present algorithms for estimating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-known summary structures for data streams, such as the COUNT-MIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require space Õ( 1 2+p ) 3 to estimate Fp to within 1± factors and requires expected time O(logF1 log 1δ ) to process each update. Thus, our technique trades an O( 1 p ) factor in space for much more efficient processing of stream updates. We also present a stand-alone iterative estimator for F1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Better Bounds for Frequency Moments in Random-Order Streams

Estimating frequency moments of data streams is a very well studied problem [1–3,9,12] and tight bounds are known on the amount of space that is necessary and sufficient when the stream is adversarially ordered. Recently, motivated by various practical considerations and applications in learning and statistics, there has been growing interest into studying streams that are randomly ordered [3,4...

متن کامل

Estimating Entropy of Data Streams Using Compressed Counting

The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Rényi entropy or Tsallis entropy, which are both functions of the αth frequency moments and approach Shannon entropy a...

متن کامل

Estimating Hybrid Frequency Moments of Data Streams

We consider the problem of estimating hybrid frequency moments of two dimensional data streams. In this model, data is viewed to be organized in a matrix form (Ai,j)1≤i,j,≤n. The entries Ai,j are updated coordinate-wise, in arbitrary order and possibly multiple times. The updates include both increments and decrements to the current value of Ai,j . The hybrid frequency moment Fp,q(A) is defined...

متن کامل

A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

Compressed Counting (CC) was recently proposed for approximating the αth frequency moments of data streams, for 0 < α ≤ 2. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections, especially as α → 1. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measure...

متن کامل

Improving Compressed Counting

Compressed Counting (CC) [22] was recently proposed for estimating the αth frequency moments of data streams, where 0 < α ≤ 2. CC can be used for estimating Shannon entropy, which can be approximated by certain functions of the αth frequency moments as α → 1. Monitoring Shannon entropy for anomaly detection (e.g., DDoS attacks) in large networks is an important task. This paper presents a new a...

متن کامل

Estimating Frequency Moments of Data Streams Using Random Linear Combinations

The problem of estimating the k frequency moment Fk for any nonnegative k, over a data stream by looking at the items exactly once as they arrive, was considered in a seminal paper by Alon, Matias and Szegedy [1, 2]. The space complexity of their algorithm is Õ(n1− 1 k ). For k > 2, their technique does not apply to data streams with arbitrary insertions and deletions. In this paper, we present...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007